Skip to content

fix(openai): self-heal stale Codex used% snapshots + lock semantics (#2994)#3039

Merged
Wei-Shaw merged 1 commit into
Wei-Shaw:mainfrom
StarryKira:fix/issue-2994-codex-5h-used-percent-selfheal
Jun 5, 2026
Merged

fix(openai): self-heal stale Codex used% snapshots + lock semantics (#2994)#3039
Wei-Shaw merged 1 commit into
Wei-Shaw:mainfrom
StarryKira:fix/issue-2994-codex-5h-used-percent-selfheal

Conversation

@StarryKira

Copy link
Copy Markdown
Contributor

Problem (#2994)

Newly-imported OpenAI/Codex OAuth accounts showed ~96–99% used in the 5h window even when nearly unused (a few cents). The inflated value also tripped shouldAutoPauseOpenAIAccountByQuota, so the account was excluded from scheduling ("导致后续请求无法调度到这个账号").

Root cause was a 100 - usedPercent inversion in Normalize() (commit b65dde63, PR #2918). For a fresh account whose x-codex-secondary-used-percent ≈ 1, it stored 100 - 1 = 99 into codex_5h_used_percent. That inversion was already reverted in main (PR #2993). The stored value is now the correct "used %".

What this PR adds (hardening, not a re-fix)

  1. Regression test locking in the direct "used %" semantics. They have flip-flopped twice (fix(usage): 修正 OpenAI 5h 用量窗口 used%/remaining% 颠倒 #2918fix(usage): revert OpenAI 5h used_percent inversion (#2918 regression) #2993) with no value-level guard: a fresh account (secondary_used_percent=1, 5h window) must store codex_5h_used_percent=1, not 99.

  2. Stale-bounded self-heal in resolveOpenAIQuotaUtilization — the single auto-pause chokepoint feeding the scheduler, WS forwarder, and gateway. An account already poisoned with an inflated used% gets excluded from scheduling, and a paused account never receives traffic to refresh its snapshot, so today it only recovers when the window's reset_at passes (≤5h/≤7d) or an admin opens its usage page (no background refresher exists). Now, when codex_usage_updated_at is older than 2h, the account is no longer auto-paused on that snapshot; it gets one request whose response headers refresh the snapshot via the existing UpdateCodexUsageSnapshotFromHeaders path and self-heal it.

Safety

  • A missing codex_usage_updated_at is treated as fresh (account stays paused) — no timestamp-less snapshot can silently escape auto-pause. Real poisoned snapshots always carry the timestamp.
  • An actively-served, genuinely-exhausted account refreshes codex_usage_updated_at on every response, so it never crosses the 2h bound and stays paused. Guarded by a dedicated test.
  • Trade-off: a legitimately-exhausted and idle account leaks ~1 verification request per 2h — strictly better than the current reset-only recovery, and it cannot regress a recently-updated exhausted account.

No change to Normalize(); no 100-x reintroduced; no new dependency wiring or mocks.

Tests

  • TestBuildCodexUsageExtraUpdates_FreshAccountUsedPercentNotInverted_Issue2994
  • TestOpenAIGatewayService_SelectAccountForModelWithExclusions_StaleUsageSnapshotSkipsPause_Issue2994
  • TestOpenAIGatewayService_SelectAccountForModelWithExclusions_FreshExhaustedSnapshotStillPauses_Issue2994 (guardrail)

go test -tags=unit ./internal/service/... and golangci-lint run ./internal/service/... pass.

Fixes #2994

🤖 Generated with Claude Code

…ei-Shaw#2994)

The OpenAI/Codex 5h "used %" inversion that caused fresh accounts to show
~96-99% used (PR Wei-Shaw#2918, commit b65dde6) was already reverted in Wei-Shaw#2993, so the
stored value is now the correct "used %" again. This commit hardens that fix:

1. Regression test locking in direct "used %" semantics. The semantics have
   flip-flopped twice (Wei-Shaw#2918 -> Wei-Shaw#2993) with no value-level guard — a fresh
   account (secondary_used_percent=1, 5h window) must store
   codex_5h_used_percent=1, not 99.

2. Stale-bounded self-heal in resolveOpenAIQuotaUtilization (the single
   auto-pause chokepoint). An account poisoned with an inflated used% gets
   excluded from scheduling, and a paused account never receives traffic to
   refresh its snapshot — so it stayed stuck until the window's reset_at passed
   (up to 5h/7d). When codex_usage_updated_at is older than 2h, the account is
   no longer auto-paused on that snapshot; it gets one request whose response
   headers refresh the snapshot and self-heal it. A missing timestamp is treated
   as fresh (stays paused), and an actively-served exhausted account refreshes
   the timestamp every response so it never crosses the bound — it cannot escape
   auto-pause.

No change to Normalize(); no 100-x reintroduced; no new dependency wiring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds regression guards and a “stale snapshot” escape hatch to prevent OpenAI Codex account auto-pause from permanently excluding accounts when usage snapshots are incorrect or no longer refreshed (issue #2994).

Changes:

  • Add a regression test ensuring 5h/7d used_percent semantics are not inverted.
  • Introduce a staleness threshold (openAICodexAutoPauseStaleAfter) so stale usage snapshots no longer keep accounts auto-paused.
  • Add scheduler tests validating stale snapshots allow one request to self-heal while fresh exhausted snapshots still pause.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
backend/internal/service/openai_gateway_service_codex_snapshot_test.go Adds a regression test ensuring codex_5h_used_percent / codex_7d_used_percent are stored as “used%” (not inverted).
backend/internal/service/openai_gateway_service.go Adds staleness-based bypass for auto-pause and helper to detect stale snapshots via codex_usage_updated_at.
backend/internal/service/openai_account_scheduler_test.go Adds tests covering stale-snapshot bypass and ensuring fresh exhausted snapshots remain paused.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +931 to +933
"codex_5h_reset_at": time.Now().Add(time.Hour).Format(time.RFC3339),
// Snapshot is stale: older than openAICodexAutoPauseStaleAfter (2h).
"codex_usage_updated_at": time.Now().Add(-3 * time.Hour).Format(time.RFC3339),
Comment on lines +1514 to +1517
updatedAt, err := parseTime(fmt.Sprint(updatedRaw))
if err != nil {
return false
}
@Wei-Shaw Wei-Shaw merged commit 83cce85 into Wei-Shaw:main Jun 5, 2026
6 of 7 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 5, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gpt oauth用量窗口统计问题

3 participants